Code and data files for this project are published on GitHub: cjunwon/STA-207
Initial report can be downloaded from here.
To skip to key conclusions from the initial analysis and new results, click here.
Original data source: Harvard Dataverse: Tennessee’s Student Teacher Achievement Ratio (STAR) project
Project STAR, short for the Student/Teacher Achievement Ratio Project, emerged in the mid-1980s as a pioneering effort to explore the relationship between class size and student academic outcomes. Motivated by policy concerns regarding the efficacy of smaller classes, the study was funded by the Tennessee General Assembly and implemented as a randomized controlled trial. In this experiment, students and teachers were randomly assigned to one of three classroom environments—small classes (13–17 students), regular classes (22–25 students), and regular classes with a teacher’s aide. This design was intended to isolate the effect of class size on academic performance while controlling for potential confounders, thereby providing strong evidence on the causal impacts of educational settings.
Beyond its initial focus on early childhood education, Project STAR was designed as a longitudinal study that followed students from kindergarten through third grade, and later into high school. This extended follow-up allowed researchers to investigate long-term outcomes, including high school achievement, graduation rates, and preparedness for higher education. The extensive dataset, which encompasses detailed academic records, teacher assessments, and demographic information, has been invaluable in shaping educational policy and research. By systematically analyzing the long-term effects of early educational interventions, Project STAR has contributed significantly to our understanding of how classroom environments influence academic trajectories and overall student success.
library(dplyr)
library(ggplot2)
library(ggthemes)
library(knitr)
library(kableExtra)
library(patchwork)
library(car)
library(MASS)
library(broom)
library(AER)
library(foreign)
library(forcats)
library(tidyr)
library(ggalluvial)
library(plotly)
library(patchwork)
STAR_Students <- read.spss('dataverse_files/PROJECT STAR/STAR_Students.sav', to.data.frame=TRUE)
# Comparison_Students <- read.spss('dataverse_files/PROJECT STAR/Comparison_Students.sav', to.data.frame=TRUE)
STAR_K3_Schools <- read.spss('dataverse_files/PROJECT STAR/STAR_K-3_Schools.sav', to.data.frame=TRUE)
STAR_High_Schools <- read.spss('dataverse_files/PROJECT STAR/STAR_High_Schools.sav', to.data.frame=TRUE)
This investigation utilizes the STAR-and-Beyond database from the Harvard Dataverse, which contains detailed information on students, teachers, and schools involved in Project STAR. The dataset includes records from the original STAR study, as well as follow-up data from high school if available.
The primary student-level data file contains information on 11,601 students who participated in the experimental phase for at least one year between 1985 and 1989. Information for each of grades K-3 includes:
As part of the extended follow-up, added to the records of some or all students, include:
Note: This investigation does not necessarily encompass all variables in the dataset, but rather focuses on key areas of interest related to class size and student achievement (discussed in the subsequent sections).
Project STAR was initiated following the passage of House Bill (HB) 544 by the Tennessee Legislature in May 1985, aimed at investigating the effects of class size on student achievement and development in primary grades (K–3). The legislation outlined three primary research questions:
To implement this study, the Tennessee State Department of Education established a research consortium involving representatives from the Department, State Board of Education, State Superintendents’ Association, and four Tennessee universities. The study adhered to an experimental design, randomly assigning students entering kindergarten in 1985 or first grade in 1986 to one of three class conditions:
Randomization was executed by consortium members and supervised locally by university-affiliated graduate students, ensuring unbiased assignment based on gender, race, and socioeconomic status.
All Tennessee schools were invited to participate under conditions set by the state, including the random assignment requirement, maintenance of standard school policies aside from class size adjustments, and commitment for four consecutive years. Of the initially interested 180 schools, 79 were ultimately selected from 42 districts to ensure representation of inner-city, suburban, urban, and rural settings:
Participation fluctuated slightly due to mergers and withdrawals, primarily attributed to challenges maintaining randomization and administrative burdens. Consequently, the number of participating schools ranged from 79 in kindergarten to 75 by third grade.
After the initial year, STAR administrators modified the study slightly by randomly redistributing half of the students between regular (R) and regular-aide (RA) classes for subsequent years due to no significant kindergarten performance differences found between these two groups. Small-class assignments remained unchanged. This is a caveat that is addressed in the subsequent analysis.
Teacher training occurred for a subset of second-grade teachers, with no significant difference in student achievement outcomes observed between trained and untrained teachers. Student mobility also influenced class composition, with new entrants randomly assigned while maintaining small-class constraints. This “class size drift” was documented and considered in subsequent analyses.
Academic performance was evaluated annually using the Stanford Achievement Tests (SATs) and the Tennessee Basic Skills First (BSF) tests. Student self-concept and motivation were measured using the SCAMIN inventory. Beyond third grade, additional longitudinal data were collected, including academic performance in grades 4–8 (via the Tennessee Comprehensive Assessment Program, TCAP), student participation and identification with school surveys, college entrance examination data (ACT/SAT), high school transcripts, and graduation/dropout information.
These detailed design features and rigorous methodologies positioned Project STAR as a landmark experimental study capable of robustly determining the causal impacts of class size on educational outcomes. However, the project did not come without limitations and challenges.
Despite its robust experimental design, Project STAR has several notable limitations that should be acknowledged when interpreting its findings:
Project STAR experienced considerable student mobility, resulting in many students not remaining in their assigned class types throughout the study period. Such mobility led to a phenomenon known as “class size drift,” where the actual sizes of regular classes sometimes became similar to those of small classes, potentially diluting the experimental contrast and complicating causal inference.
The purposeful selection of schools, which aimed to cover diverse geographic and socioeconomic areas within Tennessee, might limit the external validity of the findings. Specifically, Project STAR schools were slightly larger and had slightly lower initial achievement scores compared to statewide averages, raising questions about how representative the findings are for other educational contexts.
The project provided only limited teacher training, which did not specifically equip teachers to leverage smaller class sizes effectively. Additionally, training was not uniformly administered, and there was no demonstrated impact of the training itself. Thus, differences in instructional quality or consistency across classes might have influenced outcomes, independent of class size.
Although the study was longitudinal, it only maintained controlled class-size conditions through grade three, after which students returned to standard-sized classes. The analysis of longer-term effects beyond third grade thus faces challenges in isolating the direct impact of early exposure to small classes from subsequent educational experiences.
Aside from controlling for class size and the presence of aides, the study deliberately maintained “normal” school operations. This approach meant that other important classroom variables, such as teaching methods, curriculum variations, and peer dynamics, remained uncontrolled, potentially confounding the observed effects.
In the initial analysis of Project STAR data, we were primarily interested in answering the following two questions:
Primary question: Are there any differences in math scaled scores in 1st grade across class types?
Secondary question: If there are differences, which class type is associated with the highest math scaled scores in 1st grade?
To answer these questions, we adopted the following two-way ANOVA model with the following structure:
\[Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\] where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)), and the index \(j\) represents the school indicator. The rest of the parameters are as follows:
The assumptions of the two-way ANOVA model are as follows:
We answered the primary question of interest by conducting an F-test to determine if there are significant differences in math scaled scores across class types. The null and alternative hypotheses were as follows:
Assumptions for the F-test include the normality of residuals and homoscedasticity, which remain the same as the two-way ANOVA model.
The F-test results indicated that the p-value for class type
(star1) is less than 0.05, suggesting that there
are significant differences in math scaled scores across class
types. We rejected the null hypothesis and concluded
that at least one class type has a significantly different mean math
score compared to the others.
We implemented the Tukey HSD test to find that students in small classes have significantly higher math scores compared to students in regular classes and regular classes with an aide. However, there was no significant difference between regular classes and regular classes with an aide.
While the initial analysis report provided some valuable insights into the short-term effects of class size on student math scores in 1st grade, several caveats and limitations should be considered:
Limited Focus on Math Scaled Scores: The analysis primarily focused on math scaled scores as the outcome variable, neglecting other subjects or measures of student achievement. This narrow focus might not capture the full spectrum of educational outcomes influenced by class size. Utilizing scores from other subjects or broader achievement metrics could provide a more comprehensive understanding of the impact of class size on student learning. It would also allow for a more comprehensive comparison into the long-term effects of class size on student achievement.
Short-term Analysis: The initial analysis only considered the math scores of 1st-grade students, providing a snapshot of the immediate effects of class size on academic performance. While this short-term perspective is valuable, it fails to capture the long-term implications of early educational experiences. A more comprehensive analysis that tracks student outcomes over multiple grades and years would offer a more nuanced understanding of how class size influences academic trajectories. This would require a longitudinal approach that follows students beyond the early grades and into high school and beyond.
Operational Adjustments Post-1st Grade: The initial analysis did not account for the operational adjustments made after the first year of the study, such as the redistribution of students between regular and regular-aide classes. While most students who were designated to small classes continued in that setting, students in regular and regular-aide classes were randomly reassigned. In an update in 1999, it was reported that class size and pupil teacher ratios (PTR) are not the same, and that PTR does not influence student outcomes. Therefore for further analysis, it would be more efficient and accurate to consider the class sizes as either small or regular (with and without aide together).
data_alluvium <- subset(STAR_Students, select = c(gkclasstype, g1classtype, g2classtype, g3classtype))
class_levels <- c("small", "regular", "regular-aide")
data_alluvium <- data_alluvium %>%
mutate(across(everything(), ~ factor(as.numeric(.),
levels = c(1, 2, 3),
labels = class_levels))) %>%
mutate(across(everything(), ~ fct_explicit_na(., na_level = "Unknown (NA)")))
grade_mapping <- c("gkclasstype" = "Kindergarten",
"g1classtype" = "Grade 1",
"g2classtype" = "Grade 2",
"g3classtype" = "Grade 3")
pairs <- list(c("gkclasstype", "g1classtype"),
c("g1classtype", "g2classtype"),
c("g2classtype", "g3classtype"))
flow_list <- lapply(pairs, function(x) {
data_alluvium %>%
group_by(across(all_of(x))) %>%
summarise(value = n(), .groups = "drop") %>%
mutate(
source = paste(grade_mapping[x[1]], ": ", .[[x[1]]], sep = ""),
target = paste(grade_mapping[x[2]], ": ", .[[x[2]]], sep = "")
) %>%
select(source, target, value)
})
transitions <- bind_rows(flow_list)
nodes <- unique(c(transitions$source, transitions$target))
nodes_df <- data.frame(name = nodes, stringsAsFactors = FALSE)
nodes_df$index <- seq_len(nrow(nodes_df)) - 1
transitions <- transitions %>%
left_join(nodes_df, by = c("source" = "name")) %>%
rename(source_index = index) %>%
left_join(nodes_df, by = c("target" = "name")) %>%
rename(target_index = index)
get_class <- function(x) sub(".*: ", "", x)
nodes_df$class_type <- sapply(nodes_df$name, get_class)
class_colors <- c(
"small" = "#F5C310", # yellow
"regular" = "#0072B2", # blue
"regular-aide" = "#009E73", # green
"Unknown (NA)" = "#D55E00" # reddish/orange
)
nodes_df$color <- class_colors[nodes_df$class_type]
transitions$link_color <- scales::alpha(nodes_df$color[ match(transitions$source, nodes_df$name) ], 0.4)
p <- plot_ly(
type = "sankey",
orientation = "h",
arrangement = "snap",
node = list(
label = nodes_df$name,
pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5),
color = nodes_df$color
),
link = list(
source = transitions$source_index,
target = transitions$target_index,
value = transitions$value,
color = transitions$link_color,
opacity = 0.1
)
) %>%
layout(
title = "Alluvial Diagram of Students' Transfer",
font = list(size = 10),
width = 800
)
p
data_alluvium <- STAR_Students %>%
select(gkclasstype, g1classtype, g2classtype, g3classtype) %>%
drop_na() %>%
mutate(across(everything(), ~ factor(as.numeric(.),
levels = c(1, 2, 3),
labels = c("small", "regular", "regular-aide"))))
grade_mapping <- c("gkclasstype" = "Kindergarten",
"g1classtype" = "Grade 1",
"g2classtype" = "Grade 2",
"g3classtype" = "Grade 3")
pairs <- list(c("gkclasstype", "g1classtype"),
c("g1classtype", "g2classtype"),
c("g2classtype", "g3classtype"))
flow_list <- lapply(pairs, function(x) {
data_alluvium %>%
group_by(across(all_of(x))) %>%
summarise(value = n(), .groups = "drop") %>%
mutate(
source = paste(grade_mapping[x[1]], ": ", .[[x[1]]], sep = ""),
target = paste(grade_mapping[x[2]], ": ", .[[x[2]]], sep = "")
) %>%
select(source, target, value)
})
transitions <- bind_rows(flow_list)
nodes <- unique(c(transitions$source, transitions$target))
nodes_df <- data.frame(name = nodes, stringsAsFactors = FALSE)
nodes_df$index <- seq_len(nrow(nodes_df)) - 1
transitions <- transitions %>%
left_join(nodes_df, by = c("source" = "name")) %>%
rename(source_index = index) %>%
left_join(nodes_df, by = c("target" = "name")) %>%
rename(target_index = index)
get_class <- function(x) sub(".*: ", "", x)
nodes_df$class_type <- sapply(nodes_df$name, get_class)
class_colors <- c(
"small" = "#F5C310", # yellow
"regular" = "#0072B2", # blue
"regular-aide" = "#009E73" # green
)
nodes_df$color <- class_colors[nodes_df$class_type]
transitions$link_color <- scales::alpha(nodes_df$color[ match(transitions$source, nodes_df$name) ], 0.4)
p <- plot_ly(
type = "sankey",
orientation = "h",
arrangement = "snap",
node = list(
label = nodes_df$name,
pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5),
color = nodes_df$color
),
link = list(
source = transitions$source_index,
target = transitions$target_index,
value = transitions$value,
color = transitions$link_color
)
) %>%
layout(
title = "Alluvial Diagram of Students' Transfer (NA Removed)",
font = list(size = 10),
width = 800
)
p
With these caveats in mind, we aim to extend the analysis of Project STAR data to explore the long-term effects of class size on student academic achievement. The natural new question of interest is:
Extended question: For students who complete both primary and secondary education with the objective of pursuing higher education (college), does the exposure to small class sizes in early education (K-3) have a significant impact on their high school academic performance and college readiness?
One more point to keep in mind is that Tennessee implemented a new student assessment system the year STAR students entered grade 4, the Tennessee Comprehensive Assessment Program (TCAP). The TCAP assessment battery included norm-referenced tests from the Comprehensive Tests of Basic Skills (CTBS/McGraw Hill, 1989) and BSF criterion-referenced tests for each grade in reading and mathematics. Scores on these tests were made available by the Tennessee State Department of Education, as students progressed from grade 4 (1989-1990) through grade 8 (1993-1994).
The user guide notes that “Scores on the CTBS are not directly comparable to those on the SATs. However, IRT scale scores were available for each CTBS subtest so that comparisons can be made meaningfully across grades 4—8.” Hence the scaled scores are valid for comparison across grades 4-8.
The new question of interest requires a subset of students who have completed both primary and secondary education, and that we have complete data on their academic performance and college readiness. To achieve this, we will filter the dataset to include only students who have complete data using the flags in the data file the binary flag variables indicate participation/non-participation at each stage of data collection.
| Flag Type | Variable Name | Description |
|---|---|---|
| In-STAR Flags | flagsgk |
In STAR in kindergarten |
flagsg1 |
In STAR in grade 1 | |
flagsg2 |
In STAR in grade 2 | |
flagsg3 |
In STAR in grade 3 | |
| Achievement-data Flags | flaggk |
Achievement data available kindergarten |
flagg1 |
Achievement data available grade 1 | |
flagg2 |
Achievement data available grade 2 | |
flagg3 |
Achievement data available grade 3 | |
flagg4 |
Achievement data available grade 4 | |
flagg5 |
Achievement data available grade 5 | |
flagg6 |
Achievement data available grade 6 | |
flagg7 |
Achievement data available grade 7 | |
flagg8 |
Achievement data available grade 8 | |
| High School Data Flags | flagsatact |
Valid SAT/ACT score available |
flaghscourse |
At least two years of high school course data | |
flaghsgraduate |
Data on high school graduation status available |
Our subset will require Achievement-data Flags and High School Data Flags to be “YES” for each student. This ensures that we have complete data on student achievement from kindergarten through grade 8 and high school graduation status. We should note that the question of interest is specific to students who have completed both primary and secondary education. Therefore any concerns for selection bias is beyond the scope of this analysis.
STAR_flag_vars <- c('FLAGSGK',
'FLAGSG1',
'FLAGSG2',
'FLAGSG3')
Achievement_flag_vars <- c('flaggk',
'flagg1',
'flagg2',
'flagg3',
'flagg4',
'flagg5',
'flagg6',
'flagg7',
'flagg8')
HS_College_flag_var <- c('flagsatact',
'flaghscourse',
'flaghsgraduate')
# Subset students who have Achievement_flag_vars and HS_College_flag_var columns == 'YES'
All_Grades_Students <- STAR_Students %>%
filter(if_all(all_of(Achievement_flag_vars), ~ . == "YES") &
if_all(all_of(HS_College_flag_var), ~ . == "YES"))
Achievement_data_students <- STAR_Students %>%
filter(if_all(all_of(Achievement_flag_vars), ~ . == "YES"))
HS_College_data_students <- STAR_Students %>%
filter(if_all(all_of(HS_College_flag_var), ~ . == "YES"))
student_count <- nrow(STAR_Students)
complete_student_count1 <- nrow(Achievement_data_students)
incomplete_student_count1 <- student_count - complete_student_count1
counts1 <- c(complete_student_count1, incomplete_student_count1)
percent1 <- round(100 * counts1 / student_count, 1)
labels1 <- paste(c("Data Available", "Data \n Unavailable"),
"\nCount:", counts1,
"\n", percent1, "%")
complete_student_count2 <- nrow(HS_College_data_students)
incomplete_student_count2 <- student_count - complete_student_count2
counts2 <- c(complete_student_count2, incomplete_student_count2)
percent2 <- round(100 * counts2 / student_count, 1)
labels2 <- paste(c("Data Available", "Data \n Unavailable"),
"\nCount:", counts2,
"\n", percent2, "%")
complete_student_count3 <- nrow(All_Grades_Students)
incomplete_student_count3 <- student_count - complete_student_count3
counts3 <- c(complete_student_count3, incomplete_student_count3)
percent3 <- round(100 * counts3 / student_count, 1)
labels3 <- paste(c("Data Available", "Data \n Unavailable"),
"\nCount:", counts3,
"\n", percent3, "%")
par(mfrow = c(2, 2), mar = c(1, 4, 1, 4))
pie(counts1, labels = labels1,
main = "Achievement Data Completeness", col = c("lightblue", "lightgray"))
pie(counts2, labels = labels2,
main = "High School Data Completeness", col = c("lightgreen", "lightgray"))
pie(counts3, labels = labels3,
main = "Achievement & High School Completeness", col = c("lightcoral", "lightgray"))
par(mfrow = c(1, 1))
We notice that the proportion of students with complete data for both achievement and high school/college readiness is relatively low compared to the total number of students in the dataset. This highlights the challenges associated with longitudinal studies and the importance of data completeness for robust analyses. Despite the limitations, we are still left with 546 students who have complete data for the longitudinal analysis.
Any results or conclusions beyond this point is based on these 546 students who have complete data for the analysis.
We keep the following variables for the longitudinal analysis:
| Type | Variable Name | Description |
|---|---|---|
| Demographic Variables | stdntid |
Student ID |
| Class Type & STAR Participation | gkclasstype |
Class type in kindergarten |
g1classtype |
Class type in grade 1 | |
g2classtype |
Class type in grade 2 | |
g3classtype |
Class type in grade 3 | |
cmpstype |
Class type composite | |
cmpsdura |
Duration composite | |
yearsstar |
Number of years in STAR | |
| Reading & Math Scores | gktreadss |
Total reading scaled score SAT kindergarten |
gktmathss |
Total math scaled score SAT kindergarten | |
g1treadss |
Total reading scale scores SAT Grade 1 | |
g1tmathss |
Total math scale scores SAT Grade 1 | |
g2treadss |
Total math scale scores SAT Grade 2 | |
g2tmathss |
Total reading scale scores SAT Grade 2 | |
g3treadss |
Total math scale scores SAT Grade 3 | |
g3tmathss |
Total reading scale scores SAT Grade 3 | |
g4treadss |
Total math scale scores CTBS Grade 4 | |
g4tmathss |
Total reading scale scores CTBS Grade 4 | |
g5treadss |
Total math scale scores CTBS Grade 5 | |
g5tmathss |
Total reading scale scores CTBS Grade 5 | |
g6treadss |
Total math scale scores CTBS Grade 6 | |
g6tmathss |
Total reading scale scores CTBS Grade 6 | |
g7treadss |
Total math scale scores CTBS Grade 7 | |
g7tmathss |
Total reading scale scores CTBS Grade 7 | |
g8treadss |
Total math scale scores CTBS Grade 8 | |
g8tmathss |
Total reading scale scores CTBS Grade 8 | |
| High School Performance | hsgpaoverall |
GPA overall high school |
hsactcomp |
ACT composite score high school | |
hsactconverted |
SAT –> ACT (test score reported in ACT composite metric) high school | |
hsgrdcol |
High school graduation status |
SAT_students <- STAR_Students %>%
filter(hssat == "YES" & hsact == "NO")
ACT_students <- STAR_Students %>%
filter(hsact == "YES" & hssat == "NO")
SAT_ACT_students <- STAR_Students %>%
filter(hssat == "YES" & hsact == "YES")
no_SAT_ACT_students <- STAR_Students %>%
filter(hssat == "NO" & hsact == "NO")
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)
df <- data.frame(
Category = c("SAT Only", "ACT Only", "Both SAT & ACT", "Neither SAT nor ACT"),
Count = c(num_SAT_students, num_ACT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
)
df$Percentage <- round(df$Count / sum(df$Count) * 100, 1)
pie_chart4 <- plot_ly(
data = df,
labels = ~Category,
values = ~Percentage,
type = 'pie',
textinfo = 'percent',
hoverinfo = 'label+percent+text',
text = ~paste0("Count: ", Count),
marker = list(colors = c("lightblue", "lightgreen", "lightcoral", "lightgray")),
title = "SAT and ACT Participation (all students)",
width = 500, height = 400
)
SAT_students <- All_Grades_Students %>%
filter(hssat == "YES" & hsact == "NO")
ACT_students <- All_Grades_Students %>%
filter(hsact == "YES" & hssat == "NO")
SAT_ACT_students <- All_Grades_Students %>%
filter(hssat == "YES" & hsact == "YES")
no_SAT_ACT_students <- All_Grades_Students %>%
filter(hssat == "NO" & hsact == "NO")
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)
df <- data.frame(
Category = c("SAT Only", "ACT Only", "Both SAT & ACT", "Neither SAT nor ACT"),
Count = c(num_SAT_students, num_ACT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
)
df$Percentage <- round(df$Count / sum(df$Count) * 100, 1)
pie_chart5 <- plot_ly(
data = df,
labels = ~Category,
values = ~Percentage,
type = 'pie',
textinfo = 'percent',
hoverinfo = 'label+percent+text',
text = ~paste0("Count: ", Count),
marker = list(colors = c("lightblue", "lightgreen", "lightcoral", "lightgray")),
title = "SAT and ACT Participation (subsetted students)",
width = 500, height = 400
)
# Subset data for STAR_Students
SAT_students <- STAR_Students[STAR_Students$hssat == "YES" & STAR_Students$hsact == "NO", ]
ACT_students <- STAR_Students[STAR_Students$hsact == "YES" & STAR_Students$hssat == "NO", ]
SAT_ACT_students <- STAR_Students[STAR_Students$hssat == "YES" & STAR_Students$hsact == "YES", ]
no_SAT_ACT_students <- STAR_Students[STAR_Students$hssat == "NO" & STAR_Students$hsact == "NO", ]
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)
# Data for first pie chart
counts1 <- c(num_ACT_students, num_SAT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
labels1 <- c("ACT Only", "SAT Only", "Both SAT & ACT", "Neither SAT nor ACT")
percentages1 <- round(counts1 / sum(counts1) * 100, 1)
labels1 <- paste(labels1, "\n", percentages1, "%")
# Subset data for All_Grades_Students
SAT_students <- All_Grades_Students[All_Grades_Students$hssat == "YES" & All_Grades_Students$hsact == "NO", ]
ACT_students <- All_Grades_Students[All_Grades_Students$hsact == "YES" & All_Grades_Students$hssat == "NO", ]
SAT_ACT_students <- All_Grades_Students[All_Grades_Students$hssat == "YES" & All_Grades_Students$hsact == "YES", ]
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
# Data for second pie chart
counts2 <- c(num_ACT_students, num_SAT_students, num_SAT_ACT_students)
labels2 <- c("ACT Only", "SAT Only", "Both SAT & ACT")
percentages2 <- round(counts2 / sum(counts2) * 100, 1)
labels2 <- paste(labels2, "\n", percentages2, "%")
# Set graphical parameters for side-by-side plots
par(mfrow = c(1, 2), mar = c(1, 4, 10, 4))
# Pie chart for STAR_Students
pie(counts1, labels = labels1, col = c("lightgreen", "lightblue", "lightcoral", "lightgray"),
main = "SAT & ACT Participation \n (All Students)")
# Pie chart for All_Grades_Students
pie(counts2, labels = labels2, col = c("lightgreen", "lightblue", "lightcoral"),
main = "SAT & ACT Participation \n (Subsetted Students)")
# Reset graphical parameters
par(mfrow = c(1, 1))
To help us utilize standardized college entrance exam scores, we need both ACT and SAT scores to be comparable (i.e. on the same scale). The ACT scores can range between 1 to 36, while the SAT scores can range between 400 to 1600 during the time of the study. While the SAT score ranges changed over time, at the time of this report, it has reverted back to the same 400-1600 scale.
Converting the scores to match either exam scale will inevitably lead
to some loss of information, but it is necessary for meaningful
comparisons. To minimize this loss of information, we
will keep the popular ACT composite score, hsactcomp (taken
by 91% of subsetted students), as is and utilize the converted SAT
composite to ACT score (hsactconverted) for the
analysis.
variables_of_interest <- c(
'stdntid',
'gkclasstype',
'g1classtype',
'g2classtype',
'g3classtype',
'cmpstype',
'cmpsdura',
'yearsstar',
'gktreadss',
'gktmathss',
'g1treadss',
'g1tmathss',
'g2treadss',
'g2tmathss',
'g3treadss',
'g3tmathss',
'g4treadss',
'g4tmathss',
'g5treadss',
'g5tmathss',
'g6treadss',
'g6tmathss',
'g7treadss',
'g7tmathss',
'g8treadss',
'g8tmathss',
'hsgpaoverall',
'hsactcomp',
'hsactconverted',
'hsgrdcol'
)
# Keep variables_of_interest from All_Grades_Students
subsetted_data <- All_Grades_Students %>%
select(all_of(variables_of_interest))
In this section, we will verify the initial analysis results with the subset of students who have complete data for the longitudinal analysis.
# Table of percentages and counts for class type (`cmpstype`)
class_type_table <- subsetted_data %>%
group_by(cmpstype) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))
# Print using kable
head(class_type_table) %>%
kable(caption = "Distribution of Class Types in Subsetted Data") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| cmpstype | count | percentage |
|---|---|---|
| SMALL | 185 | 33.88% |
| REGULAR | 66 | 12.09% |
| AIDE | 255 | 46.70% |
| NA | 40 | 7.33% |
# histogram of high school GPA
gpa_hist <- ggplot(subsetted_data, aes(x = hsgpaoverall)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Distribution of High School GPA",
x = "High School GPA",
y = "Frequency") +
theme_minimal()
# Side by side boxplots of high school GPA by class type (`cmpstype`)
gpa_boxplot <- ggplot(subsetted_data, aes(x = cmpstype, y = hsgpaoverall)) +
geom_boxplot(fill = "skyblue",
color = "black") +
labs(title = "High School GPA by Class Type",
x = "Class Type",
y = "High School GPA") +
theme_minimal()
gpa_hist + gpa_boxplot
We will create a new column college_readiness that keeps
the ACT and SAT (converted to ACT scale) scores for students. If a
student has taken both exams, we will use the maximum score. This
combined score of college readiness exams will serve as a proxy for
college readiness.
subsetted_data <- subsetted_data %>%
mutate(college_readiness = pmax(hsactcomp, hsactconverted, na.rm = TRUE))
# Histogram of college readiness scores
readiness_hist <- ggplot(subsetted_data, aes(x = college_readiness)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of College Readiness Scores",
x = "College Readiness Score",
y = "Frequency") +
theme_minimal()
# Side by side boxplots of college readiness scores by class type (`cmpstype`)
readiness_boxplot <- ggplot(subsetted_data, aes(x = cmpstype, y = college_readiness)) +
geom_boxplot(fill = "skyblue",
color = "black") +
labs(title = "College Readiness Scores by Class Type",
x = "Class Type",
y = "College Readiness Score") +
theme_minimal()
readiness_hist + readiness_boxplot
num_of_graduates <- subsetted_data %>%
filter(hsgrdcol == "YES") %>%
nrow()
percent_graduates <- sprintf("%.2f%%", num_of_graduates / nrow(subsetted_data) * 100)
num_of_non_graduates <- subsetted_data %>%
filter(hsgrdcol == "NO") %>%
nrow()
percent_non_graduates <- sprintf("%.2f%%", num_of_non_graduates / nrow(subsetted_data) * 100)
grad_df <- data.frame(
Category = c("Graduated", "Not Graduated"),
Count = c(num_of_graduates, num_of_non_graduates),
Percentage = c(percent_graduates, percent_non_graduates)
)
# Kable
grad_df %>%
kable(caption = "High School Graduation Status of Subsetted Students") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| Category | Count | Percentage |
|---|---|---|
| Graduated | 541 | 99.08% |
| Not Graduated | 5 | 0.92% |
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.4 ggalluvial_0.12.5 tidyr_1.3.1 forcats_1.0.0
## [5] foreign_0.8-86 AER_1.2-14 survival_3.6-4 sandwich_3.1-1
## [9] lmtest_0.9-40 zoo_1.8-12 broom_1.0.7 MASS_7.3-60.2
## [13] car_3.1-3 carData_3.0-5 patchwork_1.3.0 kableExtra_1.4.0
## [17] knitr_1.48 ggthemes_5.1.0 ggplot2_3.5.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.51 bslib_0.8.0 htmlwidgets_1.6.4
## [5] lattice_0.22-6 crosstalk_1.2.1 vctrs_0.6.5 tools_4.4.1
## [9] generics_0.1.3 tibble_3.2.1 fansi_1.0.6 highr_0.11
## [13] pkgconfig_2.0.3 Matrix_1.7-0 data.table_1.16.4 lifecycle_1.0.4
## [17] compiler_4.4.1 farver_2.1.2 stringr_1.5.1 munsell_0.5.1
## [21] htmltools_0.5.8.1 sass_0.4.9 lazyeval_0.2.2 yaml_2.3.10
## [25] Formula_1.2-5 pillar_1.9.0 jquerylib_0.1.4 cachem_1.1.0
## [29] abind_1.4-8 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
## [33] purrr_1.0.2 labeling_0.4.3 splines_4.4.1 fastmap_1.2.0
## [37] grid_4.4.1 colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [41] utf8_1.2.4 withr_3.0.1 scales_1.3.0 backports_1.5.0
## [45] httr_1.4.7 rmarkdown_2.28 evaluate_1.0.0 viridisLite_0.4.2
## [49] rlang_1.1.4 glue_1.8.0 xml2_1.3.6 svglite_2.1.3
## [53] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1 systemfonts_1.2.1